3  Understanding and visualising variables

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

3.1 Preparation

Please download the file “Paquot_Larsson_2020_data.xlsx” (Paquot and Larsson 2020)1 and store it in your working directory.

# Libraries
library("nycflights13")
library("readxl")
library("tidyverse")
library("ggthemes")

# Load data
cl.order <- read_xlsx("Paquot_Larsson_2020_data.xlsx")

# Inspect data
str(cl.order)
tibble [403 × 8] (S3: tbl_df/tbl/data.frame)
 $ CASE       : num [1:403] 4777 1698 953 1681 4055 ...
 $ ORDER      : chr [1:403] "sc-mc" "mc-sc" "sc-mc" "mc-sc" ...
 $ SUBORDTYPE : chr [1:403] "temp" "temp" "temp" "temp" ...
 $ LEN_MC     : num [1:403] 4 7 12 6 9 9 9 4 6 4 ...
 $ LEN_SC     : num [1:403] 10 6 7 15 5 5 12 2 24 11 ...
 $ LENGTH_DIFF: num [1:403] -6 1 5 -9 4 4 -3 2 -18 -7 ...
 $ CONJ       : chr [1:403] "als/when" "als/when" "als/when" "als/when" ...
 $ MORETHAN2CL: chr [1:403] "no" "no" "yes" "no" ...
head(cl.order)
# A tibble: 6 × 8
   CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ     MORETHAN2CL
  <dbl> <chr> <chr>       <dbl>  <dbl>       <dbl> <chr>    <chr>      
1  4777 sc-mc temp            4     10          -6 als/when no         
2  1698 mc-sc temp            7      6           1 als/when no         
3   953 sc-mc temp           12      7           5 als/when yes        
4  1681 mc-sc temp            6     15          -9 als/when no         
5  4055 sc-mc temp            9      5           4 als/when yes        
6   967 sc-mc temp            9      5           4 als/when yes        

3.2 Types of variables

The concept of the variable allows us to quantify various aspects of our observations.

  • nominal/categorical: These variables have a limited number of levels which cannot be ordered in a meaningful way. For instance, it does not matter which value of SUBORDTYPE or MORETHAN2CL comes first or last:

    unique(cl.order$SUBORDTYPE)
    [1] "temp" "caus"
    unique(cl.order$MORETHAN2CL)
    [1] "no"  "yes"
  • ordinal: Such variables can be ordered, but the intervals between their individuals values are not meaningful. Heumann (2022: 6) provides a pertinent example:

    “[T]he satisfaction with a product (unsatisfied–satisfied–very satisfied) is an ordinal variable because the values this variable can take can be ordered but the differences between ‘unsatisfied–satisfied’ and ‘satisfied–very satisfied’ cannot be compared in a numerical way”.

  • In the case of interval-scaled variables, the differences between the values can be interpreted, but their ratios must be treated with caution. A temperature of 4°C is 6 degrees warmer than -2°C; however, this does not imply that 4°C is three times warmer than -2°C. This is because the temperature scale has no true zero point; 0°C simply signifies another point on the scale and not the absence of temperature altogether.

  • Ratio-scaled variables allow both a meaningful interpretation of the differences between their values and (!) of the ratios between them. Within the context of clause length, LENGTH_DIFF values such as 4 and 8 not only suggest that the latter is four units greater than the former but also that their ratio \(\frac{8}{4} = 2\) is a valid way to describe the relationship between these values. Here a LENGTH_DIFF of 0 can be clearly viewed as the absence of a length difference.

3.3 Introduction to ggplot2

3.3.1 Building a ggplot

  • A ggplot requires at minimum three elements: (1) a data frame, (2) axis labels, and (3) a plotting option (also known as “geom”). We combine them with the + sign.
# Supply data frame
ggplot(data = cl.order,
      # Supply axis labels
        mapping = aes(x = LEN_MC, y = LEN_SC)) +
      # Set plotting option (here: scatterplot)
        geom_point()

3.3.2 Adding layers

  • Visualise a third variable using the colors argument as part of the aes() function.
ggplot(data = cl.order,
        mapping = aes(x = LEN_MC, 
                      y = LEN_SC,
                      color = ORDER)) +
        geom_point()

  • Adjust further visual parameters as you see fit:
ggplot(data = cl.order,
  mapping = aes(x = LEN_MC, y = LEN_SC)) +
1  geom_point(aes(color = ORDER, shape = SUBORDTYPE)) +
  labs(
2    title = "Length of main and subordinate clauses",
    subtitle = "Dimensions for different ordering types",
    x = "Length of main clause",
    y = "Length of subordinate clause",
    color = "ORDER",
    shape = "SUBORDTYPE"
  ) +
3  theme_classic()
1
Map variables to axes, colours and shapes.
2
Add a legend with a title, subtitle and axis labels.
3
Change the overall theme of the plot.

3.4 Visualising distributions

3.4.1 A categorical variable

  • Barplot with geom_bar()
ggplot(cl.order, aes(x = ORDER)) +
  geom_bar()

3.4.2 A numerical variable

  • Histogram with geom_histogram()
  • Densitiy plot with …
ggplot(cl.order, aes(x = LEN_MC)) +
  geom_histogram(binwidth = 1)

  • Density plot with geom_density()
ggplot(cl.order, aes(x = LEN_MC)) +
  geom_density(linewidth = 0.5)

3.4.3 A numerical and categorical variable

  • Boxplot with geom_boxplot()
ggplot(cl.order, aes(x = ORDER, y = LEN_MC)) +
  geom_boxplot()

  • Densitiy plot using the optional arguments color and/or fill
ggplot(cl.order, aes(x = LEN_MC, fill = ORDER)) +
  geom_density(alpha = 0.5)

  • A barplot with geom_col()
ggplot(cl.order, aes(x = ORDER, y = LEN_MC)) +
  geom_col(aes(x = ORDER, y = LEN_MC))

3.4.4 Two categorical variables

  • Barplots with the fill argument
ggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +
  geom_bar(position = "dodge")

3.4.5 Two numerical variables

  • Scatterplot with geom_point() (cf. 1.1.1)
ggplot(cl.order, aes(x = LEN_MC, y = LEN_SC)) +
  geom_point()

  • Line plot with geom_line(); the example is based on the flights data set from the previous session)
nycflights13::flights %>%  
  group_by(hour = sched_dep_time %/% 100) %>% 
  summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) %>%  
  filter(hour > 1) -> flights2
  
  ggplot(flights2, aes(x = hour, y = prop_cancelled)) +
  geom_line(color = "grey50") + 
  geom_point()

3.4.6 Multivariate plots

  • Advanced scatterplot with four variables: LEN_MC (x), LEN_SC (y), ORDER (colour) and SUBORDTYPE (shape)
# 4 variables
ggplot(cl.order, aes(x = LEN_MC, y = LEN_SC)) +
  geom_point(aes(color = ORDER, shape = SUBORDTYPE))

  • Facets
# 5 variables
ggplot(cl.order, aes(x = LEN_MC, y = LEN_SC)) +
  geom_point(aes(color = ORDER, shape = SUBORDTYPE)) +
  facet_wrap(~MORETHAN2CL)

3.4.7 Saving your plot

  • Save last plot displayed in the viewer to your working directory:
ggplot(cl.order, aes(x = LEN_MC, y = LEN_SC)) +
          geom_point()

ggsave("figures/clause_length_plot.png")